Introduction¶
Dr Ignaz Semmelweis was a Hungarian physician born in 1818 who worked in the Vienna General Hospital. In the past people thought of illness as caused by "bad air" or evil spirits. But in the 1800s Doctors started looking more at anatomy, doing autopsies and started making arguments based on data. Dr Semmelweis suspected that something was going wrong with the procedures at Vienna General Hospital. Semmelweis wanted to figure out why so many women in maternity wards were dying from childbed fever (i.e., puerperal fever).
Today you will become Dr Semmelweis. This is your office 👆. You will step into Dr Semmelweis' shoes and analyse the same data collected from 1841 to 1849.
The Data Source¶
Dr Semmelweis published his research in 1861. I found the scanned pages of the full text with the original tables in German, but an excellent English translation can be found here.
Import Statements¶
import pandas as pd
import numpy as np
import plotly.express as px
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
Notebook Presentation¶
pd.options.display.float_format = '{:,.2f}'.format
# Create locators for ticks on the time axis
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()
Read the Data¶
df_yearly = pd.read_csv('annual_deaths_by_clinic.csv')
# parse_dates avoids DateTime conversion later
df_monthly = pd.read_csv('monthly_deaths.csv',
parse_dates=['date'])
Preliminary Data Exploration¶
Challenge: Check out these two DataFrames ☝️.
- What is the shape of df_yearly and df_monthly? How many rows and columns?
- What are the column names?
- Which years are included in the dataset?
- Are there any NaN values or duplicates?
- What were the average number of births that took place per month?
- What were the average number of deaths that took place per month?
df_yearly.shape
(12, 4)
df_yearly.sample(3)
| year | births | deaths | clinic | |
|---|---|---|---|---|
| 7 | 1842 | 2659 | 202 | clinic 2 |
| 2 | 1843 | 3060 | 274 | clinic 1 |
| 11 | 1846 | 3754 | 105 | clinic 2 |
df_yearly.describe()
| year | births | deaths | |
|---|---|---|---|
| count | 12.00 | 12.00 | 12.00 |
| mean | 1,843.50 | 3,152.75 | 223.33 |
| std | 1.78 | 449.08 | 145.38 |
| min | 1,841.00 | 2,442.00 | 66.00 |
| 25% | 1,842.00 | 2,901.75 | 100.25 |
| 50% | 1,843.50 | 3,108.50 | 219.50 |
| 75% | 1,845.00 | 3,338.25 | 263.50 |
| max | 1,846.00 | 4,010.00 | 518.00 |
df_yearly.dtypes
year int64 births int64 deaths int64 clinic object dtype: object
df_monthly.shape
(98, 3)
df_monthly.dtypes
date datetime64[ns] births int64 deaths int64 dtype: object
df_monthly.sample(3)
| date | births | deaths | |
|---|---|---|---|
| 82 | 1847-12-01 | 273 | 8 |
| 39 | 1844-05-01 | 240 | 14 |
| 2 | 1841-03-01 | 277 | 12 |
df_monthly.describe()
| date | births | deaths | |
|---|---|---|---|
| count | 98 | 98.00 | 98.00 |
| mean | 1845-02-11 04:24:29.387755008 | 267.00 | 22.47 |
| min | 1841-01-01 00:00:00 | 190.00 | 0.00 |
| 25% | 1843-02-08 00:00:00 | 242.50 | 8.00 |
| 50% | 1845-02-15 00:00:00 | 264.00 | 16.50 |
| 75% | 1847-02-22 00:00:00 | 292.75 | 36.75 |
| max | 1849-03-01 00:00:00 | 406.00 | 75.00 |
| std | NaN | 41.77 | 18.14 |
df_monthly
| date | births | deaths | |
|---|---|---|---|
| 0 | 1841-01-01 | 254 | 37 |
| 1 | 1841-02-01 | 239 | 18 |
| 2 | 1841-03-01 | 277 | 12 |
| 3 | 1841-04-01 | 255 | 4 |
| 4 | 1841-05-01 | 255 | 2 |
| ... | ... | ... | ... |
| 93 | 1848-11-01 | 310 | 9 |
| 94 | 1848-12-01 | 373 | 5 |
| 95 | 1849-01-01 | 403 | 9 |
| 96 | 1849-02-01 | 389 | 12 |
| 97 | 1849-03-01 | 406 | 20 |
98 rows × 3 columns
Check for Nan Values and Duplicates¶
df_yearly.isnull().sum().sum()
0
df_monthly.isnull().sum().sum()
0
# looks like there is not any NaN value
df_yearly.duplicated().sum()
0
df_monthly.duplicated().sum()
0
df_monthly.duplicated().values.any()
False
Percentage of Women Dying in Childbirth¶
Challenge: How dangerous was childbirth in the 1840s in Vienna?
- Using the annual data, calculate the percentage of women giving birth who died throughout the 1840s at the hospital.
In comparison, the United States recorded 18.5 maternal deaths per 100,000 or 0.018% in 2013 (source).
df_yearly.describe()
| year | births | deaths | |
|---|---|---|---|
| count | 12.00 | 12.00 | 12.00 |
| mean | 1,843.50 | 3,152.75 | 223.33 |
| std | 1.78 | 449.08 | 145.38 |
| min | 1,841.00 | 2,442.00 | 66.00 |
| 25% | 1,842.00 | 2,901.75 | 100.25 |
| 50% | 1,843.50 | 3,108.50 | 219.50 |
| 75% | 1,845.00 | 3,338.25 | 263.50 |
| max | 1,846.00 | 4,010.00 | 518.00 |
df_yearly['deaths'] / (df_yearly['births'] + df_yearly['deaths']) * 100
0 7.24 1 13.61 2 8.22 3 7.61 4 6.46 5 10.27 6 3.40 7 7.06 8 5.65 9 2.25 10 2.00 11 2.72 dtype: float64
annual_statistics = df_yearly.groupby('year').sum().copy()
annual_statistics['percentage_of_death'] = (annual_statistics['deaths'] / (annual_statistics['births'] + annual_statistics['deaths']) * 100)
annual_statistics
| births | deaths | clinic | percentage_of_death | |
|---|---|---|---|---|
| year | ||||
| 1841 | 5478 | 323 | clinic 1clinic 2 | 5.57 |
| 1842 | 5946 | 720 | clinic 1clinic 2 | 10.80 |
| 1843 | 5799 | 438 | clinic 1clinic 2 | 7.02 |
| 1844 | 6113 | 328 | clinic 1clinic 2 | 5.09 |
| 1845 | 6733 | 307 | clinic 1clinic 2 | 4.36 |
| 1846 | 7764 | 564 | clinic 1clinic 2 | 6.77 |
annual_statistics.describe()
| births | deaths | percentage_of_death | |
|---|---|---|---|
| count | 6.00 | 6.00 | 6.00 |
| mean | 6,305.50 | 446.67 | 6.60 |
| std | 826.75 | 165.79 | 2.29 |
| min | 5,478.00 | 307.00 | 4.36 |
| 25% | 5,835.75 | 324.25 | 5.21 |
| 50% | 6,029.50 | 383.00 | 6.17 |
| 75% | 6,578.00 | 532.50 | 6.96 |
| max | 7,764.00 | 720.00 | 10.80 |
# Angelas solution
prob = df_yearly.deaths.sum() / df_yearly.births.sum() * 100
print(f'Chances of dying in the 1840s in Vienna: {prob:.3}%')
Chances of dying in the 1840s in Vienna: 7.08%
Visualise the Total Number of Births 🤱 and Deaths 💀 over Time¶
Plot the Monthly Data on Twin Axes¶
Challenge: Create a Matplotlib chart with twin y-axes. It should look something like this:
- Format the x-axis using locators for the years and months (Hint: we did this in the Google Trends notebook)
- Set the range on the x-axis so that the chart lines touch the y-axes
- Add gridlines
- Use
skyblueandcrimsonfor the line colours - Use a dashed line style for the number of deaths
- Change the line thickness to 3 and 2 for the births and deaths respectively.
- Do you notice anything in the late 1840s?
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
years = mdates.YearLocator()
months = mdates.MonthLocator()
years_formatted = mdates.DateFormatter('%Y')
plt.figure(figsize=(14,8))
plt.title('Total Number of Mothly Births and Deaths', fontsize=18)
# Increase the size and rotate the labels on the x-axis
plt.xticks(fontsize=14, rotation=45)
plt.yticks(fontsize=14)
# plt.grid(True)
ax1 = plt.gca() # get current axes
ax1.set(
xlim=(pd.Timestamp('1841-01-01'),pd.Timestamp('1849-01-01'))
)
ax1.grid(color='grey', linestyle='--')
ax2 = ax1.twinx() # create another axis that shares the same x-axis
ax1.xaxis.set_major_locator(years)
ax1.xaxis.set_major_formatter(years_formatted)
ax1.xaxis.set_minor_locator(months)
ax1.set_xlabel('Year',fontsize=18)
ax1.set_ylabel('Births', color='blue',fontsize=18)
ax2.set_ylabel('Deaths', color='red',fontsize=18)
ax1.plot(df_monthly['date'],
df_monthly['births'],
color='skyblue',
linewidth=3,
)
ax2.plot(df_monthly['date'],df_monthly['deaths'],'--',color='crimson') #linestyle='--'
plt.show()
The Yearly Data Split by Clinic¶
Now let's look at the annual data instead.
Challenge: Use plotly to create line charts of the births and deaths of the two different clinics at the Vienna General Hospital.
- Which clinic is bigger or more busy judging by the number of births?
- Has the hospital had more patients over time?
- What was the highest number of deaths recorded in clinic 1 and clinic 2?
df_yearly
| year | births | deaths | clinic | |
|---|---|---|---|---|
| 0 | 1841 | 3036 | 237 | clinic 1 |
| 1 | 1842 | 3287 | 518 | clinic 1 |
| 2 | 1843 | 3060 | 274 | clinic 1 |
| 3 | 1844 | 3157 | 260 | clinic 1 |
| 4 | 1845 | 3492 | 241 | clinic 1 |
| 5 | 1846 | 4010 | 459 | clinic 1 |
| 6 | 1841 | 2442 | 86 | clinic 2 |
| 7 | 1842 | 2659 | 202 | clinic 2 |
| 8 | 1843 | 2739 | 164 | clinic 2 |
| 9 | 1844 | 2956 | 68 | clinic 2 |
| 10 | 1845 | 3241 | 66 | clinic 2 |
| 11 | 1846 | 3754 | 105 | clinic 2 |
df_yearly_by_clinic = df_yearly.groupby(['clinic','year']).sum()
df_yearly_by_clinic
| births | deaths | ||
|---|---|---|---|
| clinic | year | ||
| clinic 1 | 1841 | 3036 | 237 |
| 1842 | 3287 | 518 | |
| 1843 | 3060 | 274 | |
| 1844 | 3157 | 260 | |
| 1845 | 3492 | 241 | |
| 1846 | 4010 | 459 | |
| clinic 2 | 1841 | 2442 | 86 |
| 1842 | 2659 | 202 | |
| 1843 | 2739 | 164 | |
| 1844 | 2956 | 68 | |
| 1845 | 3241 | 66 | |
| 1846 | 3754 | 105 |
df_yearly_by_clinic.shape
(12, 2)
df_yearly_by_clinic.index
MultiIndex([('clinic 1', 1841),
('clinic 1', 1842),
('clinic 1', 1843),
('clinic 1', 1844),
('clinic 1', 1845),
('clinic 1', 1846),
('clinic 2', 1841),
('clinic 2', 1842),
('clinic 2', 1843),
('clinic 2', 1844),
('clinic 2', 1845),
('clinic 2', 1846)],
names=['clinic', 'year'])
df_yearly_by_clinic.index[0]
('clinic 1', 1841)
df_yearly_by_clinic.query('clinic == "clinic 1"').sum()
births 20042 deaths 1989 dtype: int64
df_yearly_by_clinic.query('clinic == "clinic 2"').sum()
births 17791 deaths 691 dtype: int64
fig = px.line(df_yearly,x='year',y='births',color='clinic',title='Total Yealy Births by Clinic')
fig.show()
fig = px.line(df_yearly,
x='year'
,y='deaths',
color='clinic',
title='Total Yealy Births by Clinic')
fig.show()
Calculate the Proportion of Deaths at Each Clinic¶
Challenge: Calculate the proportion of maternal deaths per clinic. That way we can compare like with like.
- Work out the percentage of deaths for each row in the
df_yearlyDataFrame by adding a column called "pct_deaths". - Calculate the average maternal death rate for clinic 1 and clinic 2 (i.e., the total number of deaths per the total number of births).
- Create another plotly line chart to see how the percentage varies year over year with the two different clinics.
- Which clinic has a higher proportion of deaths?
- What is the highest monthly death rate in clinic 1 compared to clinic 2?
df_yearly['pct_deaths'] = (df_yearly['deaths'] / (df_yearly['births'] + df_yearly['deaths']) * 100)
df_yearly.describe()
| year | births | deaths | pct_deaths | |
|---|---|---|---|---|
| count | 12.00 | 12.00 | 12.00 | 12.00 |
| mean | 1,843.50 | 3,152.75 | 223.33 | 6.37 |
| std | 1.78 | 449.08 | 145.38 | 3.47 |
| min | 1,841.00 | 2,442.00 | 66.00 | 2.00 |
| 25% | 1,842.00 | 2,901.75 | 100.25 | 3.23 |
| 50% | 1,843.50 | 3,108.50 | 219.50 | 6.76 |
| 75% | 1,845.00 | 3,338.25 | 263.50 | 7.76 |
| max | 1,846.00 | 4,010.00 | 518.00 | 13.61 |
df_yearly
| year | births | deaths | clinic | pct_deaths | |
|---|---|---|---|---|---|
| 0 | 1841 | 3036 | 237 | clinic 1 | 7.24 |
| 1 | 1842 | 3287 | 518 | clinic 1 | 13.61 |
| 2 | 1843 | 3060 | 274 | clinic 1 | 8.22 |
| 3 | 1844 | 3157 | 260 | clinic 1 | 7.61 |
| 4 | 1845 | 3492 | 241 | clinic 1 | 6.46 |
| 5 | 1846 | 4010 | 459 | clinic 1 | 10.27 |
| 6 | 1841 | 2442 | 86 | clinic 2 | 3.40 |
| 7 | 1842 | 2659 | 202 | clinic 2 | 7.06 |
| 8 | 1843 | 2739 | 164 | clinic 2 | 5.65 |
| 9 | 1844 | 2956 | 68 | clinic 2 | 2.25 |
| 10 | 1845 | 3241 | 66 | clinic 2 | 2.00 |
| 11 | 1846 | 3754 | 105 | clinic 2 | 2.72 |
df_yearly_by_clinic = df_yearly.groupby(['clinic','year']).sum()
df_yearly_by_clinic.query('clinic == "clinic 1"').mean()
births 3,340.33 deaths 331.50 pct_deaths 8.90 dtype: float64
df_yearly_by_clinic.query('clinic == "clinic 2"').mean()
births 2,965.17 deaths 115.17 pct_deaths 3.85 dtype: float64
df_yearly_by_clinic
| births | deaths | pct_deaths | ||
|---|---|---|---|---|
| clinic | year | |||
| clinic 1 | 1841 | 3036 | 237 | 7.24 |
| 1842 | 3287 | 518 | 13.61 | |
| 1843 | 3060 | 274 | 8.22 | |
| 1844 | 3157 | 260 | 7.61 | |
| 1845 | 3492 | 241 | 6.46 | |
| 1846 | 4010 | 459 | 10.27 | |
| clinic 2 | 1841 | 2442 | 86 | 3.40 |
| 1842 | 2659 | 202 | 7.06 | |
| 1843 | 2739 | 164 | 5.65 | |
| 1844 | 2956 | 68 | 2.25 | |
| 1845 | 3241 | 66 | 2.00 | |
| 1846 | 3754 | 105 | 2.72 |
df_yearly_by_clinic.sort_values('pct_deaths',ascending=False)
| births | deaths | pct_deaths | ||
|---|---|---|---|---|
| clinic | year | |||
| clinic 1 | 1842 | 3287 | 518 | 13.61 |
| 1846 | 4010 | 459 | 10.27 | |
| 1843 | 3060 | 274 | 8.22 | |
| 1844 | 3157 | 260 | 7.61 | |
| 1841 | 3036 | 237 | 7.24 | |
| clinic 2 | 1842 | 2659 | 202 | 7.06 |
| clinic 1 | 1845 | 3492 | 241 | 6.46 |
| clinic 2 | 1843 | 2739 | 164 | 5.65 |
| 1841 | 2442 | 86 | 3.40 | |
| 1846 | 3754 | 105 | 2.72 | |
| 1844 | 2956 | 68 | 2.25 | |
| 1845 | 3241 | 66 | 2.00 |
Plotting the Proportion of Yearly Deaths by Clinic¶
fig = px.line(df_yearly,
x='year'
,y='pct_deaths',
color='clinic',
title='Percentage of deaths by Clinic')
fig.show()
The Effect of Handwashing¶
Dr Semmelweis made handwashing obligatory in the summer of 1947. In fact, he ordered people to wash their hands with clorine (instead of water).
# Date when handwashing was made mandatory
handwashing_start = pd.to_datetime('1847-06-01')
Challenge:
- Add a column called "pct_deaths" to
df_monthlythat has the percentage of deaths per birth for each row. - Create two subsets from the
df_monthlydata: before and after Dr Semmelweis ordered washing hand. - Calculate the average death rate prior to June 1947.
- Calculate the average death rate after June 1947.
df_monthly.sample(3)
| date | births | deaths | |
|---|---|---|---|
| 8 | 1841-09-01 | 213 | 4 |
| 9 | 1841-10-01 | 236 | 26 |
| 66 | 1846-08-01 | 216 | 39 |
df_monthly['pct_deaths'] = (df_monthly['deaths']/(df_monthly['births']+df_monthly['deaths'])*100)
df_monthly_before = df_monthly.query('date < @handwashing_start').copy()
df_monthly_after = df_monthly.query('date >= @handwashing_start').copy()
df_monthly_after
| date | births | deaths | pct_deaths | |
|---|---|---|---|---|
| 76 | 1847-06-01 | 268 | 6 | 2.19 |
| 77 | 1847-07-01 | 250 | 3 | 1.19 |
| 78 | 1847-08-01 | 264 | 5 | 1.86 |
| 79 | 1847-09-01 | 262 | 12 | 4.38 |
| 80 | 1847-10-01 | 278 | 11 | 3.81 |
| 81 | 1847-11-01 | 246 | 11 | 4.28 |
| 82 | 1847-12-01 | 273 | 8 | 2.85 |
| 83 | 1848-01-01 | 283 | 10 | 3.41 |
| 84 | 1848-02-01 | 291 | 2 | 0.68 |
| 85 | 1848-03-01 | 276 | 0 | 0.00 |
| 86 | 1848-04-01 | 305 | 2 | 0.65 |
| 87 | 1848-05-01 | 313 | 3 | 0.95 |
| 88 | 1848-06-01 | 264 | 3 | 1.12 |
| 89 | 1848-07-01 | 269 | 1 | 0.37 |
| 90 | 1848-08-01 | 261 | 0 | 0.00 |
| 91 | 1848-09-01 | 312 | 3 | 0.95 |
| 92 | 1848-10-01 | 299 | 7 | 2.29 |
| 93 | 1848-11-01 | 310 | 9 | 2.82 |
| 94 | 1848-12-01 | 373 | 5 | 1.32 |
| 95 | 1849-01-01 | 403 | 9 | 2.18 |
| 96 | 1849-02-01 | 389 | 12 | 2.99 |
| 97 | 1849-03-01 | 406 | 20 | 4.69 |
df_monthly_after.describe()
| date | births | deaths | pct_deaths | |
|---|---|---|---|---|
| count | 22 | 22.00 | 22.00 | 22.00 |
| mean | 1848-04-16 07:38:10.909090816 | 299.77 | 6.45 | 2.05 |
| min | 1847-06-01 00:00:00 | 246.00 | 0.00 | 0.00 |
| 25% | 1847-11-08 12:00:00 | 265.00 | 3.00 | 0.95 |
| 50% | 1848-04-16 00:00:00 | 280.50 | 5.50 | 2.02 |
| 75% | 1848-09-23 12:00:00 | 311.50 | 9.75 | 2.96 |
| max | 1849-03-01 00:00:00 | 406.00 | 20.00 | 4.69 |
| std | NaN | 49.11 | 4.97 | 1.45 |
df_monthly_before.describe()
| date | births | deaths | pct_deaths | |
|---|---|---|---|---|
| count | 76 | 76.00 | 76.00 | 76.00 |
| mean | 1844-03-12 08:31:34.736842240 | 257.51 | 27.11 | 9.15 |
| min | 1841-01-01 00:00:00 | 190.00 | 1.00 | 0.52 |
| 25% | 1842-08-24 06:00:00 | 236.75 | 11.75 | 4.20 |
| 50% | 1844-03-16 12:00:00 | 254.50 | 26.50 | 9.52 |
| 75% | 1845-10-08 18:00:00 | 280.75 | 39.50 | 13.05 |
| max | 1847-05-01 00:00:00 | 336.00 | 75.00 | 23.89 |
| std | NaN | 34.28 | 17.94 | 5.62 |
# After handwashinhg pct_of_deaths decreased from 9 to 2 percentage. Impresive
Calculate a Rolling Average of the Death Rate¶
Challenge: Create a DataFrame that has the 6 month rolling average death rate prior to mandatory handwashing.
Hint: You'll need to set the dates as the index in order to avoid the date column being dropped during the calculation.
df_monthly
| date | births | deaths | pct_deaths | |
|---|---|---|---|---|
| 0 | 1841-01-01 | 254 | 37 | 12.71 |
| 1 | 1841-02-01 | 239 | 18 | 7.00 |
| 2 | 1841-03-01 | 277 | 12 | 4.15 |
| 3 | 1841-04-01 | 255 | 4 | 1.54 |
| 4 | 1841-05-01 | 255 | 2 | 0.78 |
| ... | ... | ... | ... | ... |
| 93 | 1848-11-01 | 310 | 9 | 2.82 |
| 94 | 1848-12-01 | 373 | 5 | 1.32 |
| 95 | 1849-01-01 | 403 | 9 | 2.18 |
| 96 | 1849-02-01 | 389 | 12 | 2.99 |
| 97 | 1849-03-01 | 406 | 20 | 4.69 |
98 rows × 4 columns
rolling_avg = df_monthly_before.copy()
rolling_avg = rolling_avg.set_index('date')
rolling_avg
| births | deaths | pct_deaths | |
|---|---|---|---|
| date | |||
| 1841-01-01 | 254 | 37 | 12.71 |
| 1841-02-01 | 239 | 18 | 7.00 |
| 1841-03-01 | 277 | 12 | 4.15 |
| 1841-04-01 | 255 | 4 | 1.54 |
| 1841-05-01 | 255 | 2 | 0.78 |
| ... | ... | ... | ... |
| 1847-01-01 | 311 | 10 | 3.12 |
| 1847-02-01 | 312 | 6 | 1.89 |
| 1847-03-01 | 305 | 11 | 3.48 |
| 1847-04-01 | 312 | 57 | 15.45 |
| 1847-05-01 | 294 | 36 | 10.91 |
76 rows × 3 columns
# rolling_avg = rolling_avg.rolling(window=6,min_periods=6,on='pct_deaths',center=False).mean() # i don't like this solution, even if it's same that in course
rolling_avg['rolling_average'] = rolling_avg['pct_deaths'].rolling(window=6).mean()
# roll_df
# rolling_avg = rolling_avg.dropna()
rolling_avg
| births | deaths | pct_deaths | rolling_average | |
|---|---|---|---|---|
| date | ||||
| 1841-01-01 | 254 | 37 | 12.71 | NaN |
| 1841-02-01 | 239 | 18 | 7.00 | NaN |
| 1841-03-01 | 277 | 12 | 4.15 | NaN |
| 1841-04-01 | 255 | 4 | 1.54 | NaN |
| 1841-05-01 | 255 | 2 | 0.78 | NaN |
| ... | ... | ... | ... | ... |
| 1847-01-01 | 311 | 10 | 3.12 | 9.80 |
| 1847-02-01 | 312 | 6 | 1.89 | 7.57 |
| 1847-03-01 | 305 | 11 | 3.48 | 6.05 |
| 1847-04-01 | 312 | 57 | 15.45 | 6.46 |
| 1847-05-01 | 294 | 36 | 10.91 | 6.66 |
76 rows × 4 columns
Highlighting Subsections of a Line Chart¶
Challenge: Copy-paste and then modify the Matplotlib chart from before to plot the monthly death rates (instead of the total number of births and deaths). The chart should look something like this:
- Add 3 seperate lines to the plot: the death rate before handwashing, after handwashing, and the 6-month moving average before handwashing.
- Show the monthly death rate before handwashing as a thin dashed black line.
- Show the moving average as a thicker, crimon line.
- Show the rate after handwashing as a skyblue line with round markers.
- Look at the code snippet in the documentation to see how you can add a legend to the chart.
plt.figure(figsize=(14,8))
plt.title('', fontsize=18)
# Increase the size and rotate the labels on the x-axis
plt.xticks(fontsize=14, rotation=45)
plt.yticks(fontsize=14)
ax = plt.gca() # get current axes
ax.set(
xlim=(df_monthly['date'].min(),df_monthly['date'].max())
)
ax.grid(color='grey', linestyle='--') # also can type plt.grid()
ax.xaxis.set_major_locator(years)
ax.xaxis.set_major_formatter(years_formatted)
ax.xaxis.set_minor_locator(months)
ax.set_xlabel('Year',fontsize=18)
ax.set_ylabel('Percentage of Deatchs', color='crimson',fontsize=18)
ma_line, = plt.plot(rolling_avg.index,
rolling_avg['rolling_average'],
linestyle='--',
color='red',
linewidth=3,
label='6m Moving Average',
)
bw_line, = plt.plot(df_monthly_before['date'],df_monthly_before['pct_deaths'],'-.',color='gray',label='Before Handwashing') #linestyle='--'
aw_line, = plt.plot(df_monthly_after['date'],df_monthly_after['pct_deaths'],'o-',color='skyblue',label='After Handwashing')
plt.legend(handles=[ma_line, bw_line, aw_line],
fontsize=18)
plt.show()
Statistics - Calculate the Difference in the Average Monthly Death Rate¶
Challenge:
- What was the average percentage of monthly deaths before handwashing?
- What was the average percentage of monthly deaths after handwashing was made obligatory?
- By how much did handwashing reduce the average chance of dying in childbirth in percentage terms?
- How do these numbers compare to the average for all the 1840s that we calculated earlier?
- How many times lower are the chances of dying after handwashing compared to before?
# After handwashinhg pct_of_deaths decreased from 9 to 2 percentage. Impresive
# i did it earlier
## Angelas solution
avg_prob_before = df_monthly_before.pct_deaths.mean() * 100
print(f'Chance of death during childbirth before handwashing: {avg_prob_before:.3}%.')
avg_prob_after = df_monthly_after.pct_deaths.mean() * 100
print(f'Chance of death during childbirth AFTER handwashing: {avg_prob_after:.3}%.')
mean_diff = avg_prob_before - avg_prob_after
print(f'Handwashing reduced the monthly proportion of deaths by {mean_diff:.3}%!')
times = avg_prob_before / avg_prob_after
print(f'This is a {times:.2}x improvement!')
Chance of death during childbirth before handwashing: 9.15e+02%. Chance of death during childbirth AFTER handwashing: 2.05e+02%. Handwashing reduced the monthly proportion of deaths by 7.11e+02%! This is a 4.5x improvement!
Use Box Plots to Show How the Death Rate Changed Before and After Handwashing¶
Challenge:
- Use NumPy's
.where()function to add a column todf_monthlythat shows if a particular date was before or after the start of handwashing. - Then use plotly to create box plot of the data before and after handwashing.
- How did key statistics like the mean, max, min, 1st and 3rd quartile changed as a result of the new policy?
df_monthly['washing_hands'] = np.where(df_monthly.date < handwashing_start, 'No', 'Yes')
box = px.box(df_monthly,
x='washing_hands',
y='pct_deaths',
color='washing_hands',
title='How Have the Stats Changed with Handwashing?')
box.update_layout(xaxis_title='Washing Hands?',
yaxis_title='Percentage of Monthly Deaths',)
box.show()
Use Histograms to Visualise the Monthly Distribution of Outcomes¶
Challenge: Create a plotly histogram to show the monthly percentage of deaths.
- Use docs to check out the available parameters. Use the
colorparameter to display two overlapping histograms. - The time period of handwashing is shorter than not handwashing. Change
histnormtopercentto make the time periods comparable. - Make the histograms slighlty transparent
- Experiment with the number of bins on the histogram. Which number work well in communicating the range of outcomes?
- Just for fun, display your box plot on the top of the histogram using the
marginalparameter.
df_monthly
| date | births | deaths | pct_deaths | washing_hands | |
|---|---|---|---|---|---|
| 0 | 1841-01-01 | 254 | 37 | 12.71 | No |
| 1 | 1841-02-01 | 239 | 18 | 7.00 | No |
| 2 | 1841-03-01 | 277 | 12 | 4.15 | No |
| 3 | 1841-04-01 | 255 | 4 | 1.54 | No |
| 4 | 1841-05-01 | 255 | 2 | 0.78 | No |
| ... | ... | ... | ... | ... | ... |
| 93 | 1848-11-01 | 310 | 9 | 2.82 | Yes |
| 94 | 1848-12-01 | 373 | 5 | 1.32 | Yes |
| 95 | 1849-01-01 | 403 | 9 | 2.18 | Yes |
| 96 | 1849-02-01 | 389 | 12 | 2.99 | Yes |
| 97 | 1849-03-01 | 406 | 20 | 4.69 | Yes |
98 rows × 5 columns
histogram = px.histogram(df_monthly,
x='pct_deaths',
color='washing_hands',
histnorm='percent',
nbins=25,
# histfunc="count",
marginal="box",
)
histogram.update_traces(opacity=0.35) # transparent
# histogram.update_layout(barmode='stack')
histogram.update_layout(barmode='overlay') # does not sum
histogram.update_layout(xaxis_title='Proportion of Monthly Deaths',
yaxis_title='Count',)
histogram.show()
Use a Kernel Density Estimate (KDE) to visualise a smooth distribution¶
Challenge: Use Seaborn's .kdeplot() to create two kernel density estimates of the pct_deaths, one for before handwashing and one for after.
- Use the
shadeparameter to give your two distributions different colours. - What weakness in the chart do you see when you just use the default parameters?
- Use the
clipparameter to address the problem.
sns.kdeplot(data=df_monthly, x='pct_deaths',hue='washing_hands')
<Axes: xlabel='pct_deaths', ylabel='Density'>
# Angelas solution
plt.figure(dpi=200)
# By default the distribution estimate includes a negative death rate!
plt.xlim(0, 40)
#
sns.kdeplot(df_monthly_before.pct_deaths,
shade=True,
# clip=(0,1),
clip=(0,100),
)
sns.kdeplot(df_monthly_after.pct_deaths,
shade=True,
# clip=(0,1),
# clip=True,
clip=(0,100),
)
plt.title('Est. Distribution of Monthly Death Rate Before and After Handwashing')
plt.show()
/tmp/ipykernel_61144/1437809660.py:11: FutureWarning: `shade` is now deprecated in favor of `fill`; setting `fill=True`. This will become an error in seaborn v0.14.0; please update your code. /tmp/ipykernel_61144/1437809660.py:16: FutureWarning: `shade` is now deprecated in favor of `fill`; setting `fill=True`. This will become an error in seaborn v0.14.0; please update your code.
Use a T-Test to Show Statistical Significance¶
Challenge: Use a t-test to determine if the differences in the means are statistically significant or purely due to chance.
If the p-value is less than 1% then we can be 99% certain that handwashing has made a difference to the average monthly death rate.
- Import
statsfrom scipy - Use the
.ttest_ind()function to calculate the t-statistic and the p-value - Is the difference in the average proportion of monthly deaths statistically significant at the 99% level?
import numpy as np
from scipy import stats
stats.ttest_ind(df_monthly_before.pct_deaths,df_monthly_after.pct_deaths)
TtestResult(statistic=5.859656107395003, pvalue=6.51184766477347e-08, df=96.0)
# looks like p-value is significantly less than one percent.
#Soo we can assume with probability over 99% that handwashing
# interferes with percentage chances of giving
# a safer birth
What do you conclude from your analysis, Doctor? 😊
# that I'am Speed. Look above 😸